Sampling Strategies and Learning E ciency in Text Categorization
نویسنده
چکیده
This paper studies training set sampling strategies in the context of statistical learning for text cate-gorization. It is argued sampling strategies favoring common categories is superior to uniform coverage or mistake-driven approaches, if performance is measured by globally assessed precision and recall. The hypothesis is empirically validated by examining the performance of a nearest neighbor classiier on training samples drawn from a pool of 235,401 training texts with 29,741 distinct categories. The learning curves of the classiier are analyzed with respect to the choice of training resources, the sampling methods, the size, vocabulary and category coverage of a sample, and the category distribution over the texts in the sample. A nearly-optimal categorization performance of the classiier is achieved using a relatively small training sample, showing that statistical learning can be successfully applied to very large text categorization problems with aaordable computation.
منابع مشابه
Sampling Strategies and Learning Efficiency in Text Categorization
This paper studies training set sampling strategies in the context of statistical learning for text categorization. It is argued sampling strategies favoring common categories is superior to uniform coverage or mistake-driven approaches, if performance is measured by globally assessed precision and recall. The hypothesis is empirically validated by examining the performance of a nearest neighbo...
متن کاملImproving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملThe Effect of E-Learning Readiness on Self-Regulated Learning Strategies and Students’ Behavioral Tendency to Web-based Learning: The Mediating Role of Motivational Beliefs
Introduction: The aim of this study was to investigate the effect of e-learning readiness on self-regulated learning strategies and students' behavioral tendency to learn on the web. Method: The present study is applied in terms of its purpose and it is descriptive-correlational in terms of its nature and method. The participants of this study were the whole students of Tabriz Payam-e-Noor Univ...
متن کاملPredictive Self-Organizing Networks for Text Categorization
This paper introduces a class of predictive self-organizing neural networks known as Adaptive Resonance Associative Map (ARAM) for classi cation of free-text documents. Whereas most statistical approaches to text categorization derive classi cation knowledge based on training examples alone, ARAM performs supervised learning and integrates user-de ned classi cation knowledge in the form of IF-T...
متن کاملارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996